This script analyses gene set enrichment results from running Gowinda. I highlight cases of nominal (FDR<=0.1) and significant (FDR<=0.5) enrichment.

Figure explanation

Results from each gene set database tested are given under subtitles. I also write a little about interesting patterns I identify under the graphs with notes on the functions of gene sets and the signal detected.

The bar charts show the FDR for each gene set, the colour of the bars also corresponds to the FDR so shorter bars (i.e. smaller FDRs) are more red just to aid identification of significant results when looking at lots of graphs together. On the right of the graph I also give the expected and observed number of hits to assess the effect size.

Each row corresponds to a different subspecies dataset (all=‘all samples’, ce=‘central and eastern’, n=‘Nigeria-Cameroon’, w=‘western’) and each column corresponds to a different empirical p value threshold (0.5%, 0.1%, 0.05%).

I also plot Venn diagrams to get an understanding of whether enrichment of multiple gene sets are independent signals or not.

Setup

rm(list=ls())
knitr::opts_knit$set(root.dir = '~/OneDrive - University College London/Projects/Ostridge_PanAf/gowinda/baypass_core') 

Library

library(data.table)
options(datatable.fread.datatable=FALSE)
library(dplyr)
library(ggplot2)
library(wesanderson)
library(VennDiagram)
library(UpSetR)
library(tidyverse)
library(ggh4x)
source("scripts/plot_gowinda.R")

gowinda_out_dir="output/gowinda_output/"
stat='fpr'
subsps=c('all', 'ce', 'n', 'w')

if(stat=='p'){
  tails=c('0.5pct-cov_cor', '0.1pct-cov_cor', '0.05pct-cov_cor')
  stat_name="Empirical tails"
}else if(stat=='fpr'){
  tails=c('fpr0.5pct.non-genic_1000bp.flanks', 'fpr0.1pct.non-genic_1000bp.flanks', 'fpr0.05pct.non-genic_1000bp.flanks')
  stat_name="Candidate FPR tails"
}else{
  stop("Define stat as either 'p' (empircical p value) or fpr (FPR)")
}

Define Function

plot_gowinda_all=function(subsps, tails, gene.set){
  for(subsp in subsps){
    cat("-", subsp, "\n")
    for(tail in tails){
      gowinda_output_file=paste0(gowinda_out_dir, "/",subsp,"/",subsp,".",tail,".gowinda_out.", gene.set)
      if(file.exists(gowinda_output_file) & file.size(gowinda_output_file) > 0){
        df=fread(gowinda_output_file)
        if(nrow(df>0)){plot_gowinda(df, title=paste0(subsp, ": ", "\n", tail, " candidates"))}
      }
    }
  }
}

Gene Function Databases

GO

## - all

## - ce

## - n

## - w

Central and Eastern

Transmission_across_Electrical_Synapses and Electric_Transmission_Across_Gap_Junctions

Signal

  • Nominal enrichment at 0.1% (FDR=0.11)
## ce go fpr0.1pct.non-genic_1000bp.flanks

Reactome

## - all

## - ce

## - n

## - w

Central and Eastern

Unfolded protein response

Signal

  • Nominal enrichment at 0.5% (FDR=0.08)

GWAS

## - all

## - ce

## - n

## - w

KEGG

## - all

## - ce

## - n

## - w

Tissue expression data

## - all

## - ce

## - n

## - w

Western

Spleen

Signal

  • Significant enrichment at 0.1% and 0.05%
## w expr fpr0.1pct.non-genic_1000bp.flanks

Tissue expression data - semi-unique

## - all

## - ce

## - n

## - w

Phenotype database

## - all

## - ce

## - n

## - w

All subspecies

Pathogen Datasets

We do not really see any strong signals when running hypothesis free tests using large gene function databases. Our previous work has also pathogens to be important section pressures in chimps at different time scales and so we decided to run some more hypothesis driven tests, investigating enrichment of genes we know are important in pathogen infection.

Viruses

VIPs

## - all

## - ce

## - n

## - w

Function

  • Manually curated and highthroughoput list of genes which code for proteins which physically bind with viruses.

Signal

  • No significant enrichment.

Hypothesis Driven Pathogens (including ebel 2017)

## - all

## - ce

## - n

## - w

Central-eastern

HSV-1

Signal
  • Enriched at 0.5% tails (FDR=0.05)

Western

COVID genes

Signal
  • Enriched at 0.05% tail (FDR=0.05)
## w pathogen fpr0.05pct.non-genic_1000bp.flanks

  • NB: a potential confounding factor is that gene aliases may reduce the amount of overlap reported in the above figure but I would expect this effect to be rather small.

Immunity genes

## - all

## - ce

## - n

## - w

Dehydration genes

## - all

## - ce

## - n

## - w

Heat plot

datasets=c("go", "kegg", "reac", "gwas", "expr", "phen", "vips", "pathogen_ebel2017", "imm", "dehy")

#```{r echo=FALSE, out.width=“100%”, fig.height=3.5, fig.width=5.5}

## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Min genes per threshold:  0

dehydration genes

## Min genes per threshold:  0